Purpose: The goal of this notebook is to discuss topics surrounding hypothesis testing in frequentist statistics. This notebook will be mostly theoretical curriculum.

<b> Introduction to Hypothesis Testing in Frequentist Statistics </b>

Hypothesis testing is how we tell if there is a meaningful difference between two or more groups. There are many tests to do this. However, the test does not prove or disprove a hypothesis. The first reason is that science does not prove anything as factual or objective; science seeks to provide evidence for a concept that must be falsifiable. See Karl Popper's "Logic of Scientific Discovery" for the original material.

The issue is now, how do we organize a hypothesis test. We need to collect data (experimental units) on variables, often in experimental groups. The names are proliferated but you will often hear dependent variables (measured, result), independent variables (manipulated, explanatory), control variables (stable), exogenous variables, confounding variables (biasing), nuisance, lurking, moderating (supressing, promoting), mediating, quantitative (discrete (interval, ratio), continuous (interval, ratio)), qualitative (nominal, ordinal), latent variables, as well as control group, manipulated group, and much more. 

This becomes very complex and one can spend a whole career on experimental design. The key point here is that we should organize our groups and variables in such as way that (1.) we measure what we think we are measuring and (2.) we control for error that might cause noise in our measure. 

As an advanced topic, we do not have to control for error in the experimental design. An alternative is to control for it statistically, with one example being a confound in an ANCOVA, however this will result in less statatistical power when compared to experimental control. 

Video: https://www.youtube.com/watch?v=0oc49DyA3hU&list=PLblh5JKOoLUK0FLuzwntyYI10UQFUhsY9&index=9

"Think of how [dumb] the average person is, and realize half of them are [dumber] than that." - George Carlin

<b> Terms of Frequentism: Significance and p-values </b>

P-values tell us how confident we can be in our findings not being the result of random chance. If we want to be 95% confident our findings were not due to random chance, then we have a significance threshold of .05, also known as alpha. More on alpha later. If a p-value from a statistical analysis is less than .05 (if that is our preferred threshold), than we have found signficance. In everyday terms, this means there is something interesting happening in the data. If we properly controlled the data collection, then we can say with 95% confidence that whatever interesting thing happened was due to the manipulation. In other words, if we conducted this experiment 100 times, we would be wrong 5% of the time. 

Consider this arbitrary example. Two teams are studying for a test. Team 1 studies individually and x-bar is equal to 77. Team 2 studies in peer settings and their average is 98. We conduct a t-test to compare them (more on general linear models later) and a p-value of .0045 is found. This means that there is something interesting in the day, and we can conclude it was because of peer-learning. 

This can also be done with correlations and other tests.

Video (p-values): https://www.youtube.com/watch?v=vemZtEM63GY&list=PLblh5JKOoLUK0FLuzwntyYI10UQFUhsY9&index=11
        
Video (calculate): https://www.youtube.com/watch?v=JQc3yx0-Q9E&list=PLblh5JKOoLUK0FLuzwntyYI10UQFUhsY9&index=12

"You need extraordinary evidence for extraordinary hypotheses" – Laplace, 1812

<b> Alpha (significance threshold, type I error rate) and Beta (power, type II error rate) </b>

Remember statistics help humans reason with probabilities, we do not do this very well naturally. Some of our significance values, typically  alpha = .05, is the threshold we should use to find meaning. HOWEVER, consider the importance of what you are testing. If you are doing business analytics for some simple exploration, an alpha level of .1 might be reasonable. If you are measuring healthcare or trying to justify something bizarre, you want a very small alpha, such as .01 (1% chance of error or type 1 error and thus, alpha is the probability of a significant result when H0 is true.). 

beta is the probability of a nonsignificant result, given that the alternative is true, or type 2 error. 1-beta = statistical power (the probability of a significant result when alternative is true). In other words, it is how likely you are to detect a significant effect when there really is an effect. This number is not set by convention like alpha often is (more on this later, it is called the Fisher & Neyman Pearson debate). Power is computed through things like the test, amount of items in the test, specifiying 1-tail over 2-tail (at the expense of more robust findings in both tails), or most commonly by increasing the number of experimental units (the sample). 

Video: https://www.youtube.com/watch?v=Rsc5znwR5FA&list=PLblh5JKOoLUK0FLuzwntyYI10UQFUhsY9&index=14

Video: https://www.youtube.com/watch?v=bsZGt-caXO4&list=PLblh5JKOoLUK0FLuzwntyYI10UQFUhsY9&index=35

Bonus: It is worth your time to download GPower and use the calculator to compute power for some tests.

Reading: https://www.psychologie.hhu.de/arbeitsgruppen/allgemeine-psychologie-und-arbeitspsychologie/gpower

Video: https://www.youtube.com/watch?v=VX_M3tIyiYk&list=PLblh5JKOoLUK0FLuzwntyYI10UQFUhsY9&index=15

Reading: Type 1 Errors Inflated: Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant. Psychological Science, 22(11), 1359–1366. http://doi.org/10.1177/0956797611417632

Reading: Type 1 Error Control: Rutherford, A. (2011). ANOVA and ANCOVA: a GLM approach (2nd ed). Hoboken, N.J: Wiley. Sections 3.6-3.10

Reading:Introduction to Power: Cohen, J. (1992). Statistical power analysis. Current Directions in Psychological Science, 1(3), 98–101.

There is also some debate over whether a post-hoc power analysis is appropriate. When designing an experiment, it is required that you determine your power, and something like GPower will give you the sample size. However, some statisticians argue that you should compute power after the data collection too, as your sample size is probably not exact.

Advanced: You can also improve power through advanced methods. One example is an experimental design where the within-groups design with a within measure correlation >.5. 

"Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write." H.G.Wells

![Error%20Chart.PNG](attachment:Error%20Chart.PNG)

A trending concept is science is the idea of Pre-Registration and Open Science. This ensures scientists are being fair and open with their methods. The process involves: specify what you want to do, ensure a plan of action, do not hypothesize after collecting results, justify your sample size, IV, and DV, and describe the process. This helps eliminate p-hacking.

Advanced: If we do not achieve our alpha critical value, then we can conduct an equivalence test, bayes factors, or Bayesian estimation to determine if there was an effect.

<b> Bonus: Return of the p-values: P-Value Debate and P-Hacking </b>

p-hacking refers to manipulating statistics and hypotheses to find significance. This is facilitated by people wanting to find meaningful differences in their data. However, this is bad because the analyzer is biased by the data, and it can lead to false findings in publications. Scientists must work together to optimize the confidence of their findings for the general use. The issue is data collection is complex. Anyone with a basic understanding can manipulate their design, data, analysis, or assumptions into being superficially accepted as "good evidence."

Video: https://www.youtube.com/watch?v=HDCOUXE3HMM&list=PLblh5JKOoLUK0FLuzwntyYI10UQFUhsY9&index=13

Video: https://www.youtube.com/watch?v=Zpu3VtnRhLw (This is a great channel for statistical philosophy)

HARKING: Hypothesizing After Results are Known is not always considered p-hacking, but it is argued that HARKING is not testing a hypothesis, because you already know the data before the test.

"There are lies, damned lies, and statistics... Facts are stubborn, but statistics are more pliable." - Mark Twain

Frequentist statistics is characterized by its use of p-values. You will not find this in other areas such as Bayesian statistics. The question is, do p-values make sense in statistics? Hopefully this document explains these concepts clearly enough that you think statistics are okay. I generally view they are very useful, partially because they are the best we have, but also because they do give a decent view of what is going on. Next, we will get into the debate over p-values.

In summary, Fisher suggested that p-values were a continuous measure of significance. Whereas Neyman-Pearson suggested p-values should have a fixed level.

To contextualize this let us look at the p-values of test 1 (p=.049) and test 2 (p=.051).

Fisher would interpret this as having roughly the same amount of significance.

Neyman-Pearson woould suggest that only test 1 matters.

Article: https://www.frontiersin.org/articles/10.3389/fpsyg.2015.00223/full

This is a pretty serious matter because so much of our statistics revolves around p-values. I generally observe a Neyman-Pearson approach in formal work. However, I think Fisher does have some important points here. Consider a intentionally exaggerated medical situation. The contemporary treatment gives people a 95% chance of living. Now you find a new interventin where you find a p-value of .01. If you conduct this treatment on 100 patients, 1 of them would die from the treatment being a misdiganoses. Do you say the life of the 99 is better than having no treatment at all, despite the 1 being unjustly diagnosed due to error in the intervention? I think these ideas are really well postulated in Harvard's course of Justice (episode 1) available on Youtube. 

Video 1: https://www.youtube.com/watch?v=bf3egy7TQ2Q&list=PLH2l6uzC4UEW3iJO4T0qUeUEp_X-f1U7S&index=22&pp=iAQB

Video 2: https://www.youtube.com/watch?v=WWagtGT1zH4&list=PLH2l6uzC4UEW3iJO4T0qUeUEp_X-f1U7S&index=24&pp=iAQB

"... the actual and physical conduct of an experiment must govern the statistical procedure of its interpretation." R. A. Fisher

<b> Bonus: Ethical P-Interpretation </b>

1.	P values tell you about DATA, not theories.
2.	Misinterpretation 1: A non-significant p-value means the null hypothesis is true.
a.	P values tell you when something is surprising is in the data, not when a hypothesis is correct or incorrect.
3.	Misinterpretation 2: A significant p value means that the null hypothesis is false.
a.	It is best to say “Statistical significance was found with the understanding that there is a 5% chance of error for a type 2 error.”
4.	Misinterpretation 3: A significant p value means there is practical importance.
a.	To know if an effect matters (significance, correlation, etc), you need to conduct a cost benefit analysis. 
5.	Misinterpretation 4: If you have observed a significant finding, the probability that you have made a Type 1 error is 5%. 
a.	Note the difference between; what is the probability that the null hypothesis is true if I have observed p < .05; what is the probability of observing this or more extreme data?
b.	The difference is that the first question assumes the null hypothesis is true prior to collecting data.
6.	Mister interpretation 5: 1-p = the probability an effect will be replicated. 
a.	Probability for detecting significance is determined by power, not p value. 


More info: Greenland, Senn, Rothman, Carlin, Poole, Goodman, and Altman (2016), Making Null Meaningful (Harms & Lakens, 2018)

Although p-values do not tell us what is truth or even the nature of a theory. We can derive some support using a p-curve analysis. This posits the question, does a set of p-values have evidential value. To do this, we take the p-values from a line of literature or studies and plot them to determine if there is an effect. See "p-curve.com" as an example. This can also tell you something about the reliability of a series of studies.

Good Practices:
It is recommended that you do truncate or round your statistical values. This overestimates or underestimates the observed representation of the population. 

It is also recommended that you are careful with "trimming" your data. Typically this falls around the idea of Dixon's Q test, which identifies whether a value is an outlier within its distribution. If one finds an outlier, they have three options: trim it (delete it), winsorize it (replace it with a percentile value), or do nothing. I encourage you to do nothing, because in my experience it is not appropriate to edit the data to meet some expectation. However, it is your job to investigate these topics and determine where you fall on this ethics issue. 

"If ... we choose a group of social phenomena with no antecedent knowledge of the causation or absence of causation among them, then the calculation of correlation coefficients, total or partial, will not advance us a step toward evaluating the importance of the causes at work." R. A. Fisher

Video 3: https://www.youtube.com/watch?v=PPD8lER8ju4&list=PLH2l6uzC4UEW3iJO4T0qUeUEp_X-f1U7S&index=23

<b> Standard Error and Bootstrapping </b>

Standard error describes how different a sample mean may be from a population mean. This leads into the chi-square analysis in the next section. Standard error is simple to calculate, SE = SDofSample / SquareRoot(sample size). When we have several samples and their means, and take the mean of the means, we can also get a standard deviation of those means. However, in this case we refer to it as the standard error. It is useful when quantifying the variation of multiple measures. It gives a sense of variation among groups of samples. You can use bootstrapping to simulate this instead of rerunning the experiment. 

Video: https://www.youtube.com/watch?v=XNgt7F6FqDU&list=PLblh5JKOoLUK0FLuzwntyYI10UQFUhsY9&index=23

Boot strapping allows you to determine if a measurement is meaningful without collecting more data, by sampling from the distribution of the data you have. This is contended some depending on the context in which it is being used. For example, one flaw is it confirms what it already has and it does not account for issues like temporal precedence. However, it is useful when you know what the population distribution should look like because bootstrapping will allow you to generate more data (of any data type). 

Video: https://www.youtube.com/watch?v=Xz0x-8-cgaQ&list=PLblh5JKOoLUK0FLuzwntyYI10UQFUhsY9&index=25

Video: https://www.youtube.com/watch?v=N4ZQQqyIf6k&list=PLblh5JKOoLUK0FLuzwntyYI10UQFUhsY9&index=26

Reading: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4784504/

Reading: https://www.linkedin.com/advice/1/what-advantages-disadvantages-bootstrapping-data

In summary, Bootstrapping replicates data statistically instead of methodologically. You can do this with sampling with replacement, or having duplicates, so each sample is not the exact same. Note that a new dataset made from bootstrapping is called a bootstrapped dataset. You can get p values to compare bootstrapped data (when checking for random error). You can bootstrap using medians if there were outliers for example (where medians are resilient to outliers) or any other metric. 

"All life is an experiment. The more experiments you make, the better." Ralph Waldo Emerson

<b> Effect Sizes </b>

Cohen (1988) provided a statistical for measuring the amount of "effect" in an analysis. This leads us to our material on General Linear Models (in part 3) but for now, we will focus on the concept of the variable itself. Cohen's d (Cohen, 1988) suggested on a scale of 0-1, effect sizes near .2 are small, .5 are medium, and .8 are large. Cohen's d is designed for comparing two groups. It takes the difference between two means and expresses it in standard deviation units. It tells you how many standard deviations lie between the two means.

However, there is naturally some debate over how arbitrary these values are and how they should be interpreted (Thompson, 2007). Alas, Cohen's d is a popular metric for computing effect size. It is also worth noting the following alternatives as Cohen's d is often criticized for slightly overestimating effect size. It is common to describe forms of an analysis as liberal or conservative.

Hedge's g: an unbiased version of cohen's d. This is best when the two samples are not equal in sample size.

Cohen's f: Best when used in a t-test, one way ANOVA.

Cohen's f^2: Best when used in multiple regression. 

Glass's delta: When SD's are significantly different between the groups.

Video: https://www.youtube.com/watch?v=WMTxyWq4E2M

Video 2: https://www.youtube.com/watch?v=sMc5eX4OKbI

Video 3: https://www.youtube.com/watch?v=ISJqVcKZyLs

You may also see Omega Squared, Partial Omega Squared, and Eta Squared. These are effect size measures for ANOVA that measures the strength of associated between catagorical independent variables and dependent variables.

Reading: https://lbecker.uccs.edu/glm_effectsize

Reading: Effect Sizes Overview: Grissom, R. J., & Kim, J. J. (2012). Effect sizes for research: univariate and multivariate applications (2nd ed). New York: Routledge.

Reading: Cohen’s d: Cumming, G. (2013). Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis. Routledge.

Reading: Practical Effect Size Primer: Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and ANOVAs. Frontiers in Psychology, 4. http://doi.org/10.3389/fpsyg.2013.00863

Reading (Advanced): Contrasts Practical Introduction: Rosnow, R. L., & Rosenthal, R. (2009). Effect Sizes: Why, When, and How to Use Them. Zeitschrift Für Psychologie / Journal of Psychology, 217(1), 6–14. http://doi.org/10.1027/0044-3409.217.1.6

Reading (Advanced): Book on Effect Size: Ellis, P. D. (2010). The essential guide to effect sizes: statistical power, meta-analysis, and the interpretation of research results. Cambridge ; New York: Cambridge University Press.


"[Statistics are] the only tools by which an opening can be cut through the formidable thicket of difficulties that bars the path of those who pursue the science of man." Sir Francis Galton

<b> Bonus: Confidence Intervals </b>

Confidence intervals are simply ranges where 95% of the values in the data are. Anything outside of this is less than p<.05. This means anything outside is significantly different. If two confidence intervals overlap, you should use a test, but if they do not then they are significantly different. 

Video: https://www.youtube.com/watch?v=TqOeMYtOc1w&list=PLblh5JKOoLUK0FLuzwntyYI10UQFUhsY9&index=24

Reading: Confidence Intervals Explained: Cumming, G., & Finch, S. (2001). A Primer on the Understanding, Use, and Calculation of Confidence Intervals that are Based on Central and Noncentral Distributions. Educational and Psychological Measurement, 61(4), 532–574. http://doi.org/10.1177/0013164401614002

Reading: CI continued: Cumming, G., & Fidler, F. (2009). Confidence Intervals: Better Answers to Better Questions. Zeitschrift Für Psychologie / Journal of Psychology, 217(1), 15–26. http://doi.org/10.1027/0044-3409.217.1.15

Reading: Confidence Intervals from a Bayesian Perspective: Morey, R. D., Hoekstra, R., Rouder, J. N., Lee, M. D., & Wagenmakers, E.-J. (2016). The fallacy of placing confidence in confidence intervals. Psychonomic Bulletin & Review, 23(1), 103–123.


<b> Bonus: Degrees of Freedom </b>

Degrees of Freedom (Df) are used to find critical thesholds or values for establishing p-values. They are calculated using n (population) and n-1 (sample). Computational methods often do this for you. 

<b> Advanced: Sequential Analysis Corrections </b>

The is a post hoc analysis used to correct p-values when there is a sequence of tests being conducted on the data. This may be important because when doing sequential analyses, you may be inflating your error rate. For example, you analyze data during data collection, so the sample size is not fixed. You would want to avoid p-hacking by prematurally analyzing data, however this is sometimes neccessary. While I will not walk through these tests here, I will list some examples. 

Reading: How to conduct sequential analysis: Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses: Sequential analyses. European Journal of Social Psychology, 44(7), 701–710. http://doi.org/10.1002/ejsp.2023

Bonferonni Correction: The most historical, but many others have built on it. 

Holm-Bonferonni Correction (Sequentially Rejective Bonferonni Test): 

Benjamini-Hochberg Correction:

Pocock Correction: Popular in clinical settings. 

O'Brien-Fleming Correction: Popular in clinical settings.

Haybittle-Peto Correction:  

Sidak Correction:

Tukey Correction:

Holm's Step Procedure:

Hochberg's Step Procdure:

Q-Statistic:

Advanced: If you want a strong understanding of these concepts, I encourage you to skip around Dr. Daniel Lakens "Improving Your Statistical Inferences" and "Improving Your Statistical Hypotheses" on Coursera.

<b> Conclusion </b>

This is the end of Episode II. Hopefully you have a strong understanding of alpha, beta, p-values, and a few other concepts that are generally important to statistics. You also covered advanced topics like Sequential Analysis Corrections and the underlying philosophy behind p-values. One of the really cool things in statistics is you do not have to have years of experience to dive into the philosphical bases for this material; none the less there are certainly much more complex topics. It is my hope you have been equipped with the tools and terms needed to conversate on these things. In Episode III, I will be using GLMs (general linear models) to conduct statistical tests.

Bonus Philosophy of Science Readings:

Reading: Philosophy of Science: Meehl, P. E. (1990). Appraising and amending theories: The strategy of Lakatosian defense and two principles that warrant it. Psychological Inquiry, 1(2), 108–141.

Reading: PhilSci: Hull, D. L. (2010). Science as a process: an evolutionary account of the social and conceptual development of science. University of Chicago Press. 

Reading: PhilSci: Ladyman, J. (2002). Understanding philosophy of science. London; New York: Routledge.

Reading: PhilSci: Godfrey-Smith, P. (2003). Theory and reality: an introduction to the philosophy of science. Chicago: University of Chicago Press.